Daghestanian loans database

Authors: Ilya Chechuro, Michael Daniel, and Samira Verhees.

The DagLoans Database, edited by Ilia Chechuro, Michael Daniel, Nina Dobrushina and Samira Verhees, is a scientific resource created by the Linguistic Convergence Laboratory of the HSE University, Moscow (2019). The project was prepared within the framework of the Basic Research Program at the National Research University Higher School of Economics (HSE) and supported within the framework of a subsidy by the Russian Academic Excellence Project ‘5-100’. This database contains wordlists collected as part of the Daghestanian loans project by the Linguistic Convergence Laboratory at NRU HSE. The aim of the 146-item shortlist, which is based on the World Loanword Database questionnaire, is to measure lexical contact on a micro-level. In other words, to quantify lexical convergence among the speech communities of minority languages on a village-level, and to detect fine-grained areal patterns beyond general observations on the spheres of influence of certain languages.

The database provides wordlists of 146 lexical meanings collected from 147 sources (dictionaries and speakers) of 23 languages. The set of the lexical meanings used in this database is a subset of the WOLD Database wordlist Haspelmath and Tadmor (2009) and uses the same lexical entries but a different set of IDs. The database can be linked to any other lexical database upon agreement with the authors. The language sample of the database includes both languages of Daghestan as well as geographically non-Daghestanian languages that are relevant for the study of lexical influence in this region such as Persian, Russian or Arabic. For now, the table shows source Concepts and target Words. Each target word is grouped in a similarity Set - a set of words that have the same meaning and look similar. In the future, data will be added on borrowing sources. Metadata includes the name of the Village where the word was recorded, the administrative District it is part of, the Language spoken there, and the List ID: these ID’s correspond to a particular speaker or in some cases a written source like a dictionary. The data are accessible at: Github/LingConLab/DagloanDatabase. The dataset in the dummy format is available here.

The Daghestanian Loans project studies the lexical influence of different languages in Daghestan on a microlevel, i.e. on the level of granularity that is sensitive to the difference between village varieties. Data from the project on multilingualism in Daghestan show that the conditions and the degree of language contact for each village are unique. Our aim is to discover the lexical correlates of these differences. For this purpose, we compiled a wordlist of 146 concepts for cross-linguistic comparison, and developed a method for quick data collection in the field. Using a fixed list of concepts for comparison allows us to find the quantitative correlates of qualitative differences between areas, such as the spread of a certain lingua franca, the presence and degree of contact with particular languages, as well as migratory processes.

Collecting data in neighboring villages allows us to show variation between villages on the map, and it reveals the contours of various zones of influence for specific L2s. For example, lexical influence of local Turkic languages (Azerbaijani, Kumyk and Nogai) is found throughout Daghestan. In the south, however, where Azerbaijani served as lingua franca for a long time, this influence is much stronger. In the north of Daghestan bilingualism with Turkic languages was not common, and almost all Turkic borrowings in minor local languages are shared with Avar, a major native language. Turkic influence in the north was thus most likely mediated by Avar. Our first paper (currently in the final stages of preparation) details how we can detect different zones by comparing lexical samples from villages and major neighboring languages.

Contents:

              [,1]
target_words 25796
languages       23

How to cite this project

If you use data from the database in your research, please cite as follows:

Chechuro I., Daniel M., Dobrushina N., and Verhees S. 2019. Daghestanian loans database. Linguistic Convergence Laboratory, HSE. (Available online at https://lingconlab.github.io/Dagloan_database/DL_database.html, DOI, accessed on June 05, 2019.)

The database

For now, the table shows source Concepts and target Words. Each target word is grouped in a similarity Set - a set of words that have the same meaning and look similar. In the future, data will be added on borrowing sources. Metadata includes the name of the Village where the word was recorded, the administrative District it is part of, the Language spoken there, and the List ID: these ID’s correspond to a particular speaker or in some cases a written source like a dictionary. Data is accessible at: Github/LingConLab/DagloanDatabase.
The dataset in the dummy format is available here.


Version: 2019-06-05. For questions or comments contact jh.verhees@gmail.com.


Map of the surveyed villages

Hover over and / or click on a dot on the map to know more. The color of the dots corresponds to the number of lists collected in a village. Orange = dictionary data.

Sample lexical map

The map below shows the distribution of different stems for the concept ‘pepper’.

Sources of lexical influence

Cluster Dendrogram of Foreign Influence

This tree is built as follows. 0 distance is given only to two matching non-empty cells, otherwise the distance is 1. The NA’s are not counted.

     Speaker Language Village District Alibeglo1 Arkhit1 Arkhit2 Arkhit3
     Arkhit4 Arkhit5 Arkhit6 Bezhta1 Darvag1 Darvag2 Darvag3 Darvag4
     Darvag5 Darvag6 Dyubek1 Dyubek2 Dyubek3 Dyubek4 Dzhavgat1 Dzhavgat2
     Dzhavgat3 Dzhavgat4 Dzhibakhni1 Dzhibakhni2 Dzhibakhni3 Dzhibakhni4
     Helmets1 Helmets2 Helmets3 Ikhrek1 Ikhrek2 Ikhrek3 Ikhrek4 Ilisu1
     Karata1 Karata2 Karata3 Karata4 Khapil1 Khapil2 Khapil3 Khapil4
     Khapil5 Khiv1 Khiv2 Khiv3 Khiv4 Khlut1 Khlut2 Khlut3 Khlut4 Khlut5
     Khoredzh1 Khoredzh2 Khoredzh3 Khoredzh4 Khoredzh5 Khoredzh6 Khutkhul1
     Khutkhul2 Khutkhul3 Khutkhul4 Kiche1 Kiche2 Kidero1 Kidero2 Kidero3
     Kina1 Kina2 Kina3 Kurag1 Kusur1 Laka1 Laka2 Laka3 Laka4 Laka5 Laka6
     Meshabash1 Meshabash2 Mikik1 Mikik2 Qax1 Qax2 Qax3 Qax4 Qax5 Qax6
     Qax7 Qax8 Qax9 Qum1 Qum2 Rikvani1 Rutul1 Tad-Magitl1 Tad-Magitl2
     Tatil1 Tatil2 Tatil3 Tatil4 Tatil5 Tlibisho1 Tlibisho2 Tlibisho3
     Tlibisho4 Tpig1 Tsinit1 Tsinit2 Tsinit3 Tsinit4 Tsinit5 Tukita1
     Yagdyg1 Yagdyg2 Yagdyg3 Yagdyg4 Yagdyg5 Yagdyg6 Yersi1 Yersi2 Yersi3
     Yersi4 Zilo1 Zilo2
 [ reached 'max' / getOption("max.print") -- omitted 125 rows ]

Cluster Dendrogram of Foreign Influence (Strict Distances)

This tree is built as follows. 0 distance is given only to two matching non-empty cells, otherwise the distance is 1. This leads to the huge distances even if speakers are similar. The NA’s are counted.

     Speaker Language Village District Alibeglo1 Arkhit1 Arkhit2 Arkhit3
     Arkhit4 Arkhit5 Arkhit6 Bezhta1 Darvag1 Darvag2 Darvag3 Darvag4
     Darvag5 Darvag6 Dyubek1 Dyubek2 Dyubek3 Dyubek4 Dzhavgat1 Dzhavgat2
     Dzhavgat3 Dzhavgat4 Dzhibakhni1 Dzhibakhni2 Dzhibakhni3 Dzhibakhni4
     Helmets1 Helmets2 Helmets3 Ikhrek1 Ikhrek2 Ikhrek3 Ikhrek4 Ilisu1
     Karata1 Karata2 Karata3 Karata4 Khapil1 Khapil2 Khapil3 Khapil4
     Khapil5 Khiv1 Khiv2 Khiv3 Khiv4 Khlut1 Khlut2 Khlut3 Khlut4 Khlut5
     Khoredzh1 Khoredzh2 Khoredzh3 Khoredzh4 Khoredzh5 Khoredzh6 Khutkhul1
     Khutkhul2 Khutkhul3 Khutkhul4 Kiche1 Kiche2 Kidero1 Kidero2 Kidero3
     Kina1 Kina2 Kina3 Kurag1 Kusur1 Laka1 Laka2 Laka3 Laka4 Laka5 Laka6
     Meshabash1 Meshabash2 Mikik1 Mikik2 Qax1 Qax2 Qax3 Qax4 Qax5 Qax6
     Qax7 Qax8 Qax9 Qum1 Qum2 Rikvani1 Rutul1 Tad-Magitl1 Tad-Magitl2
     Tatil1 Tatil2 Tatil3 Tatil4 Tatil5 Tlibisho1 Tlibisho2 Tlibisho3
     Tlibisho4 Tpig1 Tsinit1 Tsinit2 Tsinit3 Tsinit4 Tsinit5 Tukita1
     Yagdyg1 Yagdyg2 Yagdyg3 Yagdyg4 Yagdyg5 Yagdyg6 Yersi1 Yersi2 Yersi3
     Yersi4 Zilo1 Zilo2
 [ reached 'max' / getOption("max.print") -- omitted 125 rows ]

Mediation of Turkic influence (Speakers)

Mediation of Turkic influence (Villages)

Mediation of Total Turkic Influence

Mediation of Standard Azerbaijani Influence

Mediation of Turkic Influence via Major Languages

    Speaker Language  Village District        Lexeme Present
1 Alibeglo1 Georgian Alibeglo      Qax the_beeswax_9       0
2   Arkhit1  Lezgian   Arkhit     Khiv the_beeswax_9       0
3   Arkhit2  Lezgian   Arkhit     Khiv the_beeswax_9       0
4   Arkhit3  Lezgian   Arkhit     Khiv the_beeswax_9       0
5   Arkhit4  Lezgian   Arkhit     Khiv the_beeswax_9       0
6   Arkhit5  Lezgian   Arkhit     Khiv the_beeswax_9       0

References

This web-page was created using the following R packages:

Auguie, Baptiste. 2017. GridExtra: Miscellaneous Functions for “Grid” Graphics. https://CRAN.R-project.org/package=gridExtra.

Barnier, Julien. 2019. Rmdformats: HTML Output Formats and Templates for ’Rmarkdown’ Documents. https://CRAN.R-project.org/package=rmdformats.

Boettiger, Carl. 2017. Knitcitations: Citations for ’Knitr’ Markdown Files. https://CRAN.R-project.org/package=knitcitations.

Galili, Tal. 2015. “Dendextend: An R Package for Visualizing, Adjusting, and Comparing Trees of Hierarchical Clustering.” Bioinformatics. doi:10.1093/bioinformatics/btv428.

Gehlenborg, Nils. 2017. UpSetR: A More Scalable Alternative to Venn and Euler Diagrams for Visualizing Intersecting Sets. https://CRAN.R-project.org/package=UpSetR.

Haspelmath, Martin, and Uri Tadmor. 2009. Loanwords in the World’s Languages: A Comparative Handbook. Walter de Gruyter.

Moroz, George. 2017. Lingtypology: Easy Mapping for Linguistic Typology. https://CRAN.R-project.org/package=lingtypology.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sievert, Carson. 2018. Plotly for R. https://plotly-r.com.

Suzuki, Ryota, and Hidetoshi Shimodaira. 2015. Pvclust: Hierarchical Clustering with P-Values via Multiscale Bootstrap Resampling. https://CRAN.R-project.org/package=pvclust.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.

Xie, Yihui. 2014. “Knitr: A Comprehensive Tool for Reproducible Research in R.” In Implementing Reproducible Computational Research, edited by Victoria Stodden, Friedrich Leisch, and Roger D. Peng. Chapman; Hall/CRC. http://www.crcpress.com/product/isbn/9781466561595.

———. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.name/knitr/.

———. 2019. Knitr: A General-Purpose Package for Dynamic Report Generation in R. https://yihui.name/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2019. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Ilya Chechuro, Michael Daniel, Samira Verhees

2019-06-05